Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification

Identifieur interne : 000F97 ( Main/Exploration ); précédent : 000F96; suivant : 000F98

The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification

Auteurs : Mayo Murata [Japon] ; P. Busagala [Japon] ; Wataru Ohyama [Japon] ; Tetsushi Wakabayashi [Japon] ; Fumitaka Kimura [Japon]

Source :

RBID : ISTEX:242F41C85B44E90694E34C3FA935F14E48BF6255

Descripteurs français

English descriptors

Abstract

Abstract: Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

Url:
DOI: 10.1007/11669487_45


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification</title>
<author>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
</author>
<author>
<name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
</author>
<author>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
</author>
<author>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
</author>
<author>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:242F41C85B44E90694E34C3FA935F14E48BF6255</idno>
<date when="2006" year="2006">2006</date>
<idno type="doi">10.1007/11669487_45</idno>
<idno type="url">https://api.istex.fr/document/242F41C85B44E90694E34C3FA935F14E48BF6255/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000060</idno>
<idno type="wicri:Area/Istex/Curation">000059</idno>
<idno type="wicri:Area/Istex/Checkpoint">000970</idno>
<idno type="wicri:doubleKey">0302-9743:2006:Murata M:the:impact:of</idno>
<idno type="wicri:Area/Main/Merge">001014</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:08-0029071</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000299</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000485</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000303</idno>
<idno type="wicri:doubleKey">0302-9743:2006:Murata M:the:impact:of</idno>
<idno type="wicri:Area/Main/Merge">001164</idno>
<idno type="wicri:Area/Main/Curation">000F97</idno>
<idno type="wicri:Area/Main/Exploration">000F97</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification</title>
<author>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2006</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">242F41C85B44E90694E34C3FA935F14E48BF6255</idno>
<idno type="DOI">10.1007/11669487_45</idno>
<idno type="ChapterID">45</idno>
<idno type="ChapterID">Chap45</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automatic classification</term>
<term>Character recognition</term>
<term>Content analysis</term>
<term>Digitizing</term>
<term>Document analysis</term>
<term>Document retrieval</term>
<term>Document structure</term>
<term>Full text</term>
<term>Image processing</term>
<term>Information retrieval</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Printed character</term>
<term>Printed document</term>
<term>Variance</term>
<term>Word</term>
<term>Word length</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Analyse contenu</term>
<term>Analyse documentaire</term>
<term>Caractère imprimé</term>
<term>Classification automatique</term>
<term>Document imprimé</term>
<term>Langage naturel</term>
<term>Longueur mot</term>
<term>Mot</term>
<term>Numérisation</term>
<term>Recherche documentaire</term>
<term>Recherche information</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance optique caractère</term>
<term>Structure document</term>
<term>Texte intégral</term>
<term>Traitement image</term>
<term>Variance</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Numérisation</term>
<term>Recherche documentaire</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
</country>
</list>
<tree>
<country name="Japon">
<noRegion>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
</noRegion>
<name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
<name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F97 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F97 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:242F41C85B44E90694E34C3FA935F14E48BF6255
   |texte=   The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024